Let’s get started with ggplot2

An example: GDP and Life Expectancy

library(ggplot2)
library(gapminder)  # 'gapminder' package contains the data
gapminder           # Let's take a look at the data

Another look at the data frame

str(gapminder)      # str() is a good way to look at the data frame
## Classes 'tbl_df', 'tbl' and 'data.frame':    1704 obs. of  6 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ pop      : int  8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
##  $ gdpPercap: num  779 821 853 836 740 ...

Simple Scatterplot

ggplot(gapminder, aes(x = gdpPercap, y = lifeExp)) # nothing to plot yet!

Simple Scatterplot

We can make the graph into an object to alter and add stuff later:

p <- ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp)) 
  • we create an object p, containing the pieces of information that we are interested in plotting with ggplot
  • x = gdpPercap and y = lifeExp define what will go in the x and y axes
  • they are “aesthetic mappings” connecting pieces of data with the visibles in the plot

  • Begin a plot with ggplot().
  • It creates the coordinate axis that you add to.
  • The first argument is the dataset
  • Next you want to add layers to the plot.
  • In our example: geom_point() adds a layer of points.
  • Lots of different geom functions doing different things.
  • geom functions take mapping arguments.
  • Defines how variables in your dataset are mapped to visual properties.
  • Always paired with aes().
  • The x and y arguments specify which variables to map to the axes.

Simple Scatterplot

p + geom_point()   # Now we tell ggplot that we want a satter plot

Simple Scatterplot

ggplot( data = gapminder, aes(x = gdpPercap, y = lifeExp)) +
        geom_point()   

  # Of course, we can write that in one swoop

Log Transformation

ggplot(gapminder, aes(x = log10(gdpPercap), y = lifeExp)) +
  geom_point() +
  scale_x_log10()

Let’s keep that scale setting

p <- p + scale_x_log10()

Map continent variable to aesthetic color

p + geom_point(aes(color = continent))

To recap: full plot command thus far

ggplot(gapminder, aes(x = gdpPercap, y = lifeExp)) +
  geom_point(aes(color = continent)) + scale_x_log10() 

Note, we put the aes() in the geom_point() element. We will see in a bit why.

Reduce overplotting

p + geom_point(aes(color = continent), alpha = 0.3, size=3)  

  # Setting transparency of points

Adding fitted curve

p + geom_point(aes(color = continent)) + geom_smooth()  
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

  # by default, adds a loess fit

Adding fitted curve

p + geom_point(aes(color = continent)) + 
  geom_smooth(color="black", lwd=2, se=FALSE)  
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

   # removing the confidence intervals

We could exchange the order of the layers

p + geom_smooth(color="black", lwd=2, se=FALSE) + 
  geom_point(aes(color = continent)) 
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Use a linear fit instead

lm, glm, gam, loess, rlm

p + geom_point(aes(color = continent)) + geom_smooth(method="lm")  

Smooth fit by continent

p + geom_point(aes(color = continent)) +
  geom_smooth(lwd = 2, se = FALSE, aes(color = continent))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Now all layers use the continent grouping

# We could add the aes() grouping to the overall graph p
p <- p + aes(color = continent)
p + geom_point() +
  geom_smooth(lwd = 2, se = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Why another color=continent?

# Our original plot command:
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp)) +
  geom_point() + scale_x_log10() 

# A single smoothed line through all points:
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp)) +
  geom_point() + scale_x_log10() + geom_smooth()

# Using the color aesthetic for the smoothing as well as the scatter points
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp, color = continent)) +
  geom_point() + scale_x_log10() + geom_smooth(lwd=2, se=FALSE)

# Still single black smoothed line but now points are colored by continent:
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp, color = continent)) +
  geom_point(aes(color = continent)) + scale_x_log10() +
  geom_smooth(color="black")

Grammar of Graphics

The Grammar of Graphics

  • ggplot is based on a “grammar” of graphics, an idea originated with Wilkinson (2005)

Main principles

  • there are few main principles:
    • Graphics = distinct layers of grammatical elements (or grammar rules) that map pieces of data to geometric objects (like lines and points) and attributes (like color and size)
    • if necessary some additional rules about scales, projections in a coordinate system, and data transformations are possible
    • Plots arise through aesthetic mapping
  • The grammar produces “sentences” (mappings of data to objects) but they can easily be garbled if you define poor mappings.

Three key grammatical elements

Element Description
Data The dataset being plotted.
Aesthetics The scales onto which we map our data.
Geometries The visual elements used for our data.
  • every ggplot2 plot has these three key components

All seven grammatical elements

Element Description
Data The dataset being plotted.
Aesthetics The scales onto which we map our data.
Geometries The visual elements used for our data.
Facets Plotting subsets of the data.
Statistics Statistical representations of our data to aid understanding.
Coordinates The space on which the data will be plotted.
Themes All non-data ink.

A diagram of the graphical elements

ggplot2 layers: data









ggplot2 layers: data

gapminder

ggplot2 layers: aesthetics








ggplot2 layers: aesthetics

ggplot(gapminder, aes(x = gdpPercap, y = lifeExp))


country continent year lifeExp pop gdpPercap
y x

ggplot2 layers: geometries







ggplot2 layers: geometries

ggplot(gapminder, aes(x = gdpPercap, y = lifeExp, color = continent)) +
        geom_point(alpha=0.5, size=3)

ggplot2 layers: facets






ggplot2 layers: facets

ggplot(gapminder, aes(x = gdpPercap, y = lifeExp, color = continent)) +
        geom_point(alpha=0.5, size=3) + 
        facet_grid( . ~ continent)

ggplot2 layers: statistics





ggplot2 layers: statistics

ggplot(gapminder, aes(x = gdpPercap, y = lifeExp, color = continent)) +
        geom_point(alpha=0.5, size=3) + 
        facet_grid( . ~ continent) + 
        geom_smooth(color="black", lwd=1, se=TRUE)  
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot2 layers: coordinates




ggplot2 layers: coordinates

ggplot(gapminder, aes(x = gdpPercap, y = lifeExp, color = continent)) +
        geom_point(alpha=0.5, size=3) + 
        facet_grid( . ~ continent) + 
        geom_smooth(color="black", lwd=1, se=TRUE) + 
        scale_x_log10("GDP per Capita") +
        scale_y_continuous("Life Expectancy in Years")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot2 layers: theme



ggplot2 layers: theme

theme_tufte(), theme_classic(), theme_minimal()

library(ggthemes)
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp, color = continent)) +
        geom_point(alpha=0.5, size=3) + 
        facet_grid( . ~ continent) + 
        geom_smooth(color="black", lwd=1, se=TRUE) + 
        scale_x_log10("GDP per Capita") +
        ylab("Life Expectancy in Years") + 
        theme_tufte() + theme(legend.position="none")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot2 layers: theme

library(ggthemes)
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp, color = continent)) +
        geom_point(alpha=0.5, size=3) + 
        facet_grid( . ~ continent) + 
        geom_smooth(color="black", lwd=1, se=TRUE) + 
        scale_x_log10("GDP per Capita", 
                      labels = scales::trans_format("log10", 
                      scales::math_format(10^.x))) + 
        ylab("Life Expectancy in Years") + 
        theme_economist() + 
        theme(legend.position="none") + 
        ggtitle("The relationship between wealth and longevity")

ggplot2 layers: theme

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

The Plot-Making Process in ggplot

Understanding the layers of ggplot2

Recall: the three key grammatical elements:

Element Description
Data The dataset being plotted.
Aesthetics The scales onto which we map our data.
Geometries The visual elements used for our data.

Let’s take a closer look at these now.

Recall: Aesthetic vs. Attributes

  • an attribute is simply a setting of things like color, shape, size etc. independent of what the data looks like
  • in contrast, in the aesthetics layer, we map features of the data onto visible aesthetics

Recall: Setting Attributes

Here we set three attributes of the points: alpha, size, color

library(ggplot2)
library(gapminder)
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp)) + 
        scale_x_log10() +
        geom_point(alpha=0.5, size=3, color="red")

Mapping onto shape

ggplot(gapminder, aes(x = gdpPercap, y = lifeExp, shape=continent)) + 
        geom_point(size=3, alpha=0.3) + scale_x_log10()

Mapping onto size

ggplot(gapminder, aes(x = gdpPercap, y = lifeExp, size=pop)) + 
        scale_x_log10() +
        geom_point(alpha=0.3)

Recall: Combining mappings

## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
ggplot(subset(gapminder, continent %in% c("Americas","Europe")), 
       aes(x = gdpPercap, y = lifeExp, size=year, 
           color=continent, shape=continent)) + 
       scale_x_log10() + geom_point(alpha=0.3)

Typical aesthetics

Aesthetic Description
x X axis position
y Y axis position
colour Colour of dots, outlines of other shapes
fill Fill colour
size Diameter of points, thickness of lines
alpha Transparency
linetype Line dash pattern
labels Text on a plot or axes
shape Shape

Aesthetics and Geoms

  • each geom() layer allows you to set the aesthetics that make sense for the particular plot geom()
  • for example, geom_point understands the following aesthetics: x, y, alpha, color, fill, group, shape, size, stroke. For geom_point() the aesthetics x and y are required.
  • some aesthetics are limited to continous variables, others to categorical variables

Aesthetics - Continuous Variables

Aesthetic Description
x X axis position
y Y axis position
colour Colour of dots, outlines of other shapes
fill Fill colour
size Diameter of points, thickness of lines
alpha Transparency
linetype Line dash pattern
labels Text on a plot or axes
shape Shape

Aesthetics - Continuous Variables

ggplot(filter(gapminder,year==2007), 
  aes(x = gdpPercap, y = lifeExp, size=pop)) +
  scale_x_log10() + geom_point(alpha=0.3) + 
  scale_size_continuous(name="pop", range = c(1,20))

Aesthetics - Continuous Variables

d <- filter(gapminder, year %in% c(1967,1977,1987,1997,2007))
ggplot(d, aes(x = gdpPercap, y = lifeExp, color=pop)) + 
  scale_x_log10() + geom_point(alpha=0.3, size=3)

Aesthetics - Continuous Variables

  • size works clearly better than color in this case
  • there are general guides about which types of aesthetics work better for which kind of variables – these are rooted in our understanding of visual perception as we have seen earlier.

Aesthetics - Categorical Variables - Mapping onto shape

ggplot(d, aes(x = gdpPercap, y = lifeExp, shape=continent)) +
  scale_x_log10() + geom_point(alpha=0.3, size=4)

Adding redundant channel to emphasize

ggplot(d, aes(x = gdpPercap, y = lifeExp, shape=continent)) +
  scale_x_log10() + geom_point(alpha=0.3, size=4) +
  geom_point(data=filter(d, continent=="Americas"), 
   color="red", alpha=0.5, size=4) + theme(legend.position="none")

Encircle to emphasize

# ggalt() includes the encircle() function
# devtools::install_github("hrbrmstr/ggalt", force=FALSE)
library(ggalt)
previousplot + geom_encircle(data=filter(d, country=="United States"), 
                expand=0.05, color="blue", linetype=2, size=2)

Connect to emphasize

library(ggthemes)
ggplot(d, aes(x = gdpPercap, y = lifeExp, shape=continent)) +
  scale_x_log10() + 
  geom_path(data=filter(d, country=="United States"), 
            color="light blue", linetype=1, size=6) +
  geom_path(data=filter(d, country=="Venezuela"), 
            color="light green", linetype=1, size=6) +
  geom_path(data=filter(d, country=="Haiti"), 
            color="orange", linetype=1, size=6) +
  geom_point(alpha=0.3, size=4) +
  geom_point(data=filter(d, continent=="Americas"), 
              color="red", alpha=0.5, size=4) + 
  theme(legend.position="none") +
  annotate("text", x = c(40000), y = c(73), size=6, 
           color="dark blue", label = c("United States")) +
  annotate("text", x = c(13000), y = c(63), size=6, 
           color="dark green", label = c("Venezuela")) +
  annotate("text", x = c(1200), y = c(62), size=6, 
           color="dark orange", label = c("Haiti")) + theme_tufte()

Connect to emphasize

Box plots and Dot Plots

  • For some plots we have a specific geom(). E.g. box plots are created with geom_boxplot.
  • For other plots we can use the geoms we already know. E.g. for dot plots we can use geom_point()
  • overall 37 geoms, but good to know a few. Use the ggplot2 cheat sheet.

Examples: Geoms and Type of Plot

Name of Plot Geom Other Features
scatterplot point
bubblechart point size mapped to a variable
barchart bar
box-and-whisker plot boxplot
line chart line

A New Dataset - Organ Donors

organs <- read.csv("organ_donors.csv")

dim(organs)
## [1] 238  21
head(organs)
## For convenience, let R know year is a time measure.
organs$year <- as.Date(strptime(organs$year, format="%Y"))

Let’s take a quick look

  • Let explore the data a bit with some plots.
p <- ggplot(data=organs,
            aes(x=year,
                y=donors))

p + geom_point()
## Warning: Removed 34 rows containing missing values (geom_point).

Some lineplots again

p + geom_line(aes(group=country,
                  color=consent.law)) +
    scale_color_manual(values=c("gray40", "firebrick")) +
    scale_x_date() + 
    labs(x="Year",
         y="Donors",
         color="Consent Law") +
    theme(legend.position="top")
## Warning: Removed 34 rows containing missing values (geom_path).

Faceting

  • We can also split the plot by some factor, called faceting
# ggplot has two faceting functions that do slightly different things: `facet_grid()`, seen here, and `facet_wrap()`. Try them out on the Gapminder data.

p + geom_line(aes(group=country)) +
    labs(x="Year",
         y="Donors") +
    facet_grid(.~consent.law)
## Warning: Removed 34 rows containing missing values (geom_path).

A quick bit of data manipulation - Average by group

library(dplyr)
by.country <- organs %>% group_by(consent.law, country) %>%
    summarize(donors=mean(donors, na.rm = TRUE))
by.country

Ordered dotplots

p <- ggplot(by.country, aes(x=donors, y=country, color=consent.law))
p + geom_point(size=3)

  • Note, we are using geom_point() again.
  • How can we improve this graph?

Ordering

We know that order helps visual perception.

p <- ggplot(by.country, aes(x=donors, y=reorder(country,donors), 
                            color=consent.law))
p + geom_point(size=3)

  • Get your factors (the categorical variable) in order when it makes sense.

Improve the labels

p + geom_point(size=3) +
    labs(x="Donor Procurement Rate (per million population)",
         y="", color="Consent Law") +
    theme(legend.position="top")

Another way

p <- ggplot(by.country, aes(x=donors, y=reorder(country, donors)))
p + geom_point(size=3) +
    facet_grid(consent.law ~ ., scales="free") +
    labs(x="Donor Procurement Rate (per million population)",
         y="",
         color="Consent Law") +
    theme(legend.position="top")

Boxplot

p <- ggplot(data=organs,aes(x=country,y=donors)) 
p + geom_boxplot() +
    coord_flip() +   # This is one way to get a horizontal box plot
    labs(x="", y="Donor Procurement Rate")
## Warning: Removed 34 rows containing non-finite values (stat_boxplot).

Boxplot

p <- ggplot(data=organs,aes(x=reorder(country, donors, na.rm=TRUE), y=donors)) 
p + geom_boxplot() + coord_flip() +
    labs(x="", y="Donor Procurement Rate")
## Warning: Removed 34 rows containing non-finite values (stat_boxplot).

Boxplot

p <- ggplot(data=organs,aes(x=reorder(country, donors, na.rm=TRUE),y=donors)) 
p + geom_boxplot(aes(fill=consent.law)) +
    coord_flip() + labs(x="", y="Donor Procurement Rate")
## Warning: Removed 34 rows containing non-finite values (stat_boxplot).

Boxplot - Add some jitter

# Can combine jitter and boxplot if needed
ggplot(data=organs,aes(x=reorder(country, donors, na.rm=TRUE),y=donors)) + 
  geom_boxplot(aes(fill=consent.law), outlier.colour="transparent", alpha=0.3) +
  coord_flip() + labs(x="", y="Donor Procurement Rate") +
  geom_jitter(shape=21, aes(fill=consent.law), color="black",
              position=position_jitter(w=0.1))
## Warning: Removed 34 rows containing non-finite values (stat_boxplot).
## Warning: Removed 34 rows containing missing values (geom_point).

1-D point summaries

p <- ggplot(data=organs, aes(x=reorder(country, donors, na.rm=TRUE), y=donors)) 
p + geom_point(aes(color=consent.law)) +
    coord_flip() + labs(x="", y="Donor Procurement Rate")
## Warning: Removed 34 rows containing missing values (geom_point).

Add a little jitter

p <- ggplot(data=organs,aes(x=reorder(country, donors, na.rm=TRUE), y=donors)) 
p + geom_jitter(aes(color=consent.law)) + coord_flip() + 
        labs(x="", y="Donor Procurement Rate")
## Warning: Removed 34 rows containing missing values (geom_point).

Fine-tune the jittering

p <- ggplot(data=organs, aes(x=reorder(country, assault, na.rm=TRUE), y=assault)) 
p + geom_jitter(aes(color=world),
                position = position_jitter(width=0.15)) +
    coord_flip() +
    labs(x="", y="Assault") +
    theme(legend.position="top")
## Warning: Removed 17 rows containing missing values (geom_point).

A few more useful geoms

p + geom_point() + ggtitle("point")
p + geom_text() + ggtitle("text")
p + geom_bar(stat = "identity") + ggtitle("bar") 
p + geom_tile() + ggtitle("raster")
p + geom_line() + ggtitle("line")
p + geom_area() + ggtitle("area")
p + geom_path() + ggtitle("path")
p + geom_polygon() + ggtitle("polygon")

Familiarize yourself with these options in ggplot2().

Thank you!